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(57) A low-complexity method and apparatus for 
performing waveform interpolation in a low bit-rate Wl . 
speech decoder, wherein Interpolation between re- 
ceived waveforms is performed with use of spline coef- 
ficients generated t>ased thereupon. Specifically, two 
signals are received from a Wl encoder, each connprls- 
Ing a set of frequency domain parameters representing 
a speech signal segment of a corresponding pitch peri- 
od. Then, spline coefficients are generated from each 
of the received signals, wherein each set ot spline co- 
efficients comprises a spline representatbn of a time do- 
main transformation of the corresponding set of fre- 



quency domain parameters. Finally, the decoder inter- 
polates between the spline representations to generate 
interpolated time domain data which Is used to synthe- 
size a recoristructed speech signal. In certain embodi- 
ments of the present Invention, the time scale of at least 
one of the spline representations is modified to enable 
the Interpolation therebetween. Also, in accordance with 
one illustrative embodiment ot the present invention, a 
cubic spline representation's used, while in accordance 
with another illustrative embodiment, a novel variant of 
a cardinal spline representation is advantageously em- 
ployed. 
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Description 

Field of the Invention 

5 The present invention relates generally to the field of low bit-rate speech coding, and more particularly to a method 

and apparatus for perfornr^ing low bit-rate speech coding with reduced complexity. 

Background of the Invention ^ 

Communication of speech information ofteri involves transmitting electrical signals which represent speech over 
a channel or network ("channel"). A problem commonly encountered in speech communicatran is how to transmit 
speech through a channel of limited capacity or bandwidth. (In modem digital communications systems, bandwidth is 
often expressed in terms of bit-rale. ) The problem of limited channel bandwidth is usually addressed by the application 
of a speech coding system, which compresses a speech signal to meet channel bandwidth requirements. Speech 
IS coding systems include an encoder, which converts speech signals into code words for transmission over a channel, 
and a decoder, which reconstructs speech from received code words. 

As a general matter a goal of most speech coding systems concomitant with that of signal compression is the 
faithful reproduction of original speech sounds, such as, a.g, voiced speech. Voiced speech is produced when a speak- 
er's vocal cords are tensed and vibrating quasi-periodically. In the time dornain. a voiced speech signal appears as a 
so succession of similar but slowly evolving waveforms referred to as pitch-cycles. Each pitch-cycle has a duration referred 
<o as a pitch-period. Like the pitch-cycle waveform itself, the pitch-period generally varies slowly trom one pitch-cycle 
to the next. 

Many speech coding systems which operate at bit-rates around 8 kilobits per second (kbps) code original speech 
waveforms by exploiting knowledge of the speech generation process. Illustrative of these so-called waveform coders 

ss are the code-excited linear prediction (CELP) speech coding systems, which code a speech waveform by filtering it 
with a time-varying linear prediction (LP) filter to produce a residual speech signal. During voiced speech, the residual 
signal comprises a series of pitch-cycles, each of whbh includes a major transient referred to as a pitch-pulse anti a 
series of lower amplitude vibrations surrounding it. The residual signal is represented by the CELP system as a con- 
catenation of scaled fixed-length vectors from a codebook. To achieve a high coding efficiency of voiced speech, most 

30 Implementations of CELP also include a long-term predictor (or adaptive codebook) to facilitate reconstruction of a 
communicated signal with appropriate periodicity. Despite improvements over time, however, many waveform coding 
systems suffer from perceptually significant distortion when operating at rates below 6 kb/s. This distortion is typically 
characterized as noise. 

Low bil-rate coding systems which operate, tor example, at rates of 2.4 kb/s are generally parametric in nature. 

35 That is. they operate by transmitting parameters describing pitch-period and the spectral envelope (or formants) of the 
speech signal at regular intervals. Illustrative of these so-called parametric coders is the LP vocoder system. LP voc- 
oders model a voiced speech signal with a single pulse per pitch period. This basic technque may be augmented to 
include transmission information about the spectral envebpe, among other things. Although LP vocoders provide rea- 
sonable perfonmance generally, they also may introduce perceptually significant distortion, typically characterized as 

^0 buzziness. 

The types of distortion discussed above, and another - reverberation - common in sinusoidal coding syslenns. 
are generally the result of a reconstructed speech signal which lacks (in whole or in significant part) the pitch-cycle 
dynamics found in original voiced speech. Naturally, these types of distortion are more pronounced at lower bit-rates, 
as the ability of speech coding systems to code information about speech dynamics decreases. These problems have 

45 been addressed, and significant progress has recently been.achieved in low-rate speech coding, with the introduction 
of algorithms based.on waveform interpolation ar\6 associated signal modeling techniques. The general idea behind 
these techniques is 1o try to synthesize a coded signal that mimics the natural evolution of the original speech, while 
sending as little information as possible about the original signal. This idea is based on the obsen^ation that speech 
usually carries slowly varying attributes that may be sampled and interpolated at low rates. A significant amount of 

50 Information in the signal can be discarded, as long as certain key features are faithfully regenerated. 

The main techniques used In accomplishing this task are waveform interpolation (Wl) and signal decomposition 
(SO). Wl is used in the synthesis process (/.©., in the decoder) to maintain the degree of smoothness usually observed 
in speech signal, particularly in voiced regions, f^aintaining smoothness increases the robustness to coding distortions. 
As an example, larger errors in pitch can be perceptually tolerated if the pitch varies smoothly rather than abruptly 

ss (unnaturally). The same is true for other types of dis;ortions. SD enables the coding system to focus on the more 
important signal domains, discarding information carried in less important ones. Wl coders are described, for example, 
in Y Shoham, "High-quality speech coding al 2.4 to 4.0 kbps based on time-frequency interpolation,' Proc. ICASSP . 
•93 pp. 11167-170: Y. Shoham, "High-quality speech coding at 2.4 kbps based on time-frequency interpolation," Proc. 
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Eurospeech '93. pp. 741-744; W.B. Kleijn etaJ., "A speech coder based on decomposition of characteristic waveforms. 
Proc. ICASSP '95 pp. 508-511 ; and W.B. Kleijn et al, "A low-complexity waveform interpolation coder, Proc. ICASSP 
•96, pp. 212-215. Wl coders are also described in commonly owned U.S. Patent No. 5,517,595, entitled "Decomposition 
in Noise and Periodic Signal Waveforms in Waveform Interpolation, " issued to W.B, Kleijn on May 14, 1 996. 

Although Wl coders generally produce reasonably good quality reconstructed speech at low bit rates, the com- 
plexity of these prior art coders is often too high to be commercially viable lor use. for exanriple, in low<:ost terminals. 
Therefore, it would be desirable if a Wl coder were available having substantially less compjexhy than that of prior art 
Wl coders, while maintaining an adequate level of performance (/.©., with respect to the quality of thfe reconstructed 
speech). 

Summary of the Invention 

In accordance with the present invention, an improved method andapparatus for performing waveform interpolation 
in a low bit-rate Wl speech decoder is provided, wherein interpolation between received waveforms Is performed with 
use of spline coefficients generated based thereupon. Specifically, two signals are received from a Wl encoder, each 
comprising a set of frequency domain parameters representing a speech signal segment of a corresponding pitch 
period Then, spline coefficients are generated from each of the received signals, wherein each set of spline coefficients 
comprises a spline representation of a time domain transformation of the corresponding set of frequency domain pa- 
rameters. And. finally, the decoder Interpolates between the spline representations to generate interpolated time domain 
data which is used to synthesize a reconstructed speech signal. In certain embodiments of the present invention, the 
'time scale of at least one of the spline representations is modified to enable the interpolation therebetween. Also, in 
accordance with one Illustrative embodiment of the present invention, a cubic spline representation is used, while in 
accordance with another illustrative embodiment, a novel variant of a cardinal spline representation is advantageously 
employed. 

Brief Description of the Drawings 

Figure 1 shows a surface comprising a series of smoothly evolving waveforms as may be advantageously produced 
by a waveform interpolation coder. 

Figure 2 shows a block diagram of a conventional waveform interpolation coder. 

Figure 3 shows a block diagram of waveform interpolation based on a cubic spline representation in accordance 
with a first illustrative embodiment of the present invention. 

Figure 4 shows a block diagram of waveform interpolation based on a pseudo cardinal spline representation in 
accordance with a second illustrative embodiment of the present invention. 

Figure 5 shows an illustrative set of smoothed spectrafora random spectrum codebook of a waveform interpolation 
coder. 

Figure 6 shows a block diagram of a low-complexity waveform interpolation coder in accordance with an illustrative 
embodiiment of the present invention. 

Detailed Description 

A, Overview of Waveform Interpolation 

The Wl method is based on processing a time sequence of spectra. A spectrum in such a sequence may. for 
example, be a phase-relaxed discrete Fourier transform (DFT) of a pitch-long snapshot of the speech signal. Moreover, 
the phase of the spectrum may be subjected to a circular shift. Snapshots are taken at update intervals which, in 
principle, may be as short as one sample. These update intervals can be totally pitch-independent, but, for the sake 
of efficient processing, they are preferably dynamically adapted to the pitch period. 

The Wl process can be illustratively described as follows. Let S(t.K) be a DFT of a snapshot at time t. with a time- 
varying pitch period P(t). The inverse DFT (IDFT) of S(t,K). denoted by U(t.c), is taken with respect to a constant DFT " 
basis function support of size T seconds. This Is known as lime scale normalization, familiar to those skilled in the art. 
With this normalization, U(t,c) may be viewed as a periodic function, with a period T, along the axis c. When two 
consecutive snapshots are taken at to and Xy. S(t .K) is advantageously aligned tOoS(t ,K) by a circular shift for maximum 
correlation. Therefore, if the pitch signal is slowly varying, the two-dimensional surface U(t,c) is smooth along the t 
axis. This situation is Illustratively depicted in Figure I, where all the waveforms have the same period T along the c 
axis and are slowly varying along the t axis. In reality, the surface U(t.c) is not given at any particular point but rather 
at boundary waveforms U{\q.c) and U(tT.c) corresponding to the spectra S{to.K) and S(tT.K). Values in between are 
advantageously interpolated from these spectra as described below. The variable 'C in U(t.c) represents the number 
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o. no..a..ed p«c. cycles. Por a speech signa, « is a .unct^o o, «.e. .eno.e. c(0. and gK,en .y 

(1) 

- (LcoSlr " ''^"^ ^""^ " ^ — nslona. si,na, s(.,^ ^ene^ated .y sampling t.e su^a^ a. .e po.. 

s(r) = uit, c(f)) 



(2) 
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2S 



not necessao^ to generate (/e., interpolate) the enrelrtace Drio?t f ^'"^ ""'P**^'" P^"'^""' " "^"9"/ 
path (t. c(t)) are advantageously determined by coCt^ng ^ ^'"^ '"^^ '"^ ^^"'P'ing 



where the spectrum S(1.K) is interpolated from the two boundary 



spectra: 



3S 



The /unctions a(t} and ^(t) may, for example, reproeoni linear Intarpplailon. but of^er <ntf rpolatlPH r«iM oa 

alternatively ennployed. such as. in panicutar. one that Interpolates the spectral magnitude and'^aaeVe^a^^^^^ 

cycle function c(t) is also advantageously obtained by interpolation. First, the pitch function P{t) is interpolated from its 
boundary values P{Xq) and P(t,) and then, equation (1) above is computed for to < t < t,. 

Assuming faithful transmission of the update spectra, the signal s(t) has most of the important characteristics of 
the original speech. In particular its pitch track follows the original one even though no pilch synchrony has been used 
and the update times may have beisn pitch independent. This implies a great deal of intormation reduction which is 
advantageous for low rate coding. 

In non-periodic (unvoiced) speech segments, the pitch may be set to whatever essentially arbitrary value is com- 
puted by the encoder's pitch detector and does not, therefore, represent a real pitch cycle. Moreover, the resultant 
pitch value may be advantageously modified in order to smooth the pitch track. Such a pitch may be used by the system 
in the same way, regardless of its true nature. This approach advantageously eliminates voicing classification and 
provides for robust processing. Note that even in this case (in fact, for any signal), the interpolation framework described 
above works well whenever the update interval is less than half the pitch period. 
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B. Overview of Signal Decomposition in a Wl Coder 

A Wl encoder typically analyzes and decomposes the speech signal tor efficient corilpression. In particular, the 
signal decomposition is advantageously performed on two levels. On the first level, standard lOth-order LPC analysis 
may be performed once per frame over frames of. for example. 25 msec to obtain spectral envelope (LPC) parameters 
and an LP residual signal. Splitting the signal in this manner allows for perceptually efficient quantization of the spec- 
trum. While a fairly accurate coding of the spectral envelope is preferable lor producing high quality reconstructed 
speech, significant distortions of the fine-structured LP residual spectnjm can often be tolerated, especially at higher 
frequencies. In view of this, the residual signal advantageously undergoes a 2nd-level decomposition, the purpose of 
which is to split the signal into structured and unstructured components. The structured signal is essentially periodic 
whereas the unstructured one is non-periodic and essentially random {i.e., noise-like). 

Although many advanced low-rate speech coders use this sort of basic decomposition, differing in methods and 
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mechanics, in most Wl coders, the 2nd-level decomposition is performed using the notions of slowly evolving wavetorms 
(SEW) and rapidly evolving waveforms (REW). {See, e.g., W.B. Kleijn et ai, "A speech coder based on decomposition 
of characteristic waveforms, and U.S. Patent No. 5.517,595, each referenced above.) This approach is based on the 
observation that in voiced {i.e., mostly periodic) speech segments, acoustic features like pitch and spectral parameters 
evolve rather slowly, whereas these features evolve much faster in unvoiced segments. Therefore, it may be assumed 
that it the signal is split into SEW and REW components, the SEW mostly represents a periodic component whereas 
the REW mostly represents an aperiodic noise-like signal. This decomposition may be advantageously performed in 
the LP residual domain. For this purpose the update snapshots of, the residijal may be obtained by faking pitch-size 
DFTs at times tn. thereby yielding the spectra R(tn, K). The speec.b spectra are. therefore, given by 

S{t^.K)^A{t^,K)R{t^,/q (5) 

. where A(tn. K) is the LPC spectrum at time t„. 

The SEW sequence may be obtained by filtering each spectral component {i.e., for each value of K) of f\{x^, K) 
along the temporal axis using, for example, a 20 H2, 20-tap lowpass filter. This results in a sequence of SEW spectra, 
SEW(l„. K). which may then be advantageously down-sampled to, for example, one SEW spectrum per frame. By 
using a complementary highpass filter, the sequence of REW spectra, REW^^. K) may be similarly obtained. Since 
the spectral snapshots are usually not taken at exact pitch-cycle intervals, the spectra S(t„) are advantageously aligned 
'prior to filtering. This alignment may. for example, comprise high-resolution phase adjustment, equivalent to a time- 
domain circular shift, which advantageously maximizes the correlation between the current and previous spectra. This 
eliminates artificial spectral variations due to phase misimatches. 

An interesting observation is that unlike many other decomposition methods, this decomposition is (at least in 
principle) lossless and reversible - namely, the original (aligned) sequence R(t„, K) can be recovered. Thus, this method 
does not force a ceiling on the coding performance. If the SEW and the REW are coded at sufficiently high bit rates, 
very high quality speech can be reconstructed by a conventional Wl decoder (since the entire residual signal can be 
accurately reconstructed). 

The spectra R(tn, K) are advantageously normalized to have a unit average root-mean-squared (RMS) value across 
the K axis. This removes level fluctuations, enhances the SEW/REW analysis and make it easier to quantize the REW 
and the SEW. The RfvtS level {i.e., the gain) nr»ay be quantized separately. This also allows the system to take special 
care of perceptually important changes in signal levels (e.g., onsets), independently of other parameters. 

C. A Conventional Waveform Interpolation Coder . . ■ 

Figure 2 shows a blockdiagram for a conventional Wl coder comprising encoder 21 and decoder 22. At the encoder, 
LP analysis (block 212) is applied to the input speech and the LP filler is used to gel the LP residual (block 211). Pilch 
estimator 21 4 is applied to the residual to get the current pilch period. Pitch-size snapshots (block 21 3) are taken on 
the residual, transformed by a DPT and normalized (block 21 5). The resulting sequence of spectra is first aligned (block 
217) and then filtered along the temporal axis to form the SEW (block 21B) and the REW (block 219) signals. These 
are quantized and transmitted along with the pitch LP coefficients (generated by block 212) and the spectral gains 
(generated by block 216). 

At the decoder, t^e coded REW and SEW signals are decoded and combined (block 223) to form the quantized 
excitation spectrum R(t„, K). The spectrum is then reshaped by the LPC spectral envelope and re-scaled by the gain 
to the proper RMS level (block 222). thereby producing the quantized speech spectra S(t„. K). These spectra are now 
interpolated (block 224) as described above to form the final reconstructed speech signal. 

The Wl coder of Figure 2 is capable of delivering high quality speech as long as ample bit resources are made 
available tor coding all the data, especially the REW and the SEW signals. Note that the REW/SEW representation is. 
in principle, an over-sampled one, since two full-size spectra are represented. This puts aft extra burden on the quan- 
tizers. At low bit rates, bits are scarce and the REW/SEW representation is typically severely compromised to allow 
tor a meaningful quantization, as further described below. For example, a typical conventional Wl coder operating at 
a rate of 2.4 kbps uses a frame size of 25 msec and is therefore limited to employing a bit allocation typically consisting 
of 30 bits for the LPC data, 7 bits for the pilch information, 7 bits for the SEW data. 6 bits for the REW data, and 10 
bits for the gain information. Similarly, a typical conventional Wl coder operating at a rate o1 1.2 kbps uses a frame size 
of 37.5 msec and is therefore limited to employing a bit allocation typically consisting of 25 bits for the LPC data, 7 bits 
for the pitch information, no bits for the SEW data. 5 bits for the REW data, and S bits for the gain intornnation. (Note 
that in the 1.2 kbps case, an overall flat LP spectrum is assumed, and the SEW signal is then presumed to be the 
portion thereof which is complementary to the REW signal portion which has been coded.) 
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Interpolative coding as described above is computationalty complex. Some early Wl coders actually ran much 
slower Ihen real lime. An improved lower-complexity Wl coder was proposed by W.B. Kleijn et al. in "A low-complexity 
waveform Interpolation coder," cited above, but much lower complexity coders are needed to provide for commercially 
viable alternatives in a broad range ot applications. Specifically, it is desirable that only a small traction of a processor's 
s computational power is used by the coder, so that other tasks, such as, for example, networking, can be performed 
uninterruptedly. 

Note that in a typical Wl coder, the main contributors to the computational load are the signal decomposition and 
the interpolation processes. Other significant contributors are the pitch tracking, the spectral alignment and the LPC 
quantization procedures. Memory usage is also an important factor if an inexpensive implementation is to be achieved. 
10 Typical prior art Wl coders require a large quantity of RAM to hold the REW and the SEW sequence/for the temporal 
filtering and other operations - overall, about 6K words of RAM is needed by a typical conventional Wl coder. Moreover, 
a large quantity of ROM - typically about 11 K words - is needed for the LPC quantization. 

D. Low-Complexity WaveforHi Interpolation Using Cubic Splinee 

IS . 

The waveform interpolation process as performed in conventional Wt coders and as described above is quite 
complex, partly because for every time instance, the full spectral vector needs to be interpolated and a DFT-type 
operation - e.g., the computation of equation (3) above - needs to be carried out. The non-regular sampling of the 
trigonometric functions, implied by equation (3), makes It even more complex since no simple recursive methods are 

20 Useful for implementing these functions. To address this problem, the waveform interpolation process may be advan- 
tageously approximated in accordance with an illustrative embodiment of the present invention by a much simpler 
method as follows. The spectra S(t„,K) are first augmented to a fixed radix-2 size by zero-padding. An inverse Fast 
Fourier Transform (1FFT) is taken once per update to obtain time signals of fixed-size T. These signals are then trans- 
formed into cubic spline coefficient vectors. (Cubic spline coefFicients. more completely described below, are familiar 

2S to those skilled in the signal processing arts.) Using these spline coefficients, samples of a continuous-time estimate 
of the signal can be generated at any desired point, which advantageously allows for a dynamic time-scaling as de- 
termined by the function c(t) of equation (1) above. 

The use of. a spline representation of a signal is a well-known technique for converting signals from discrete-time 
to continuous-time representations. [See, e.g., M. Unser et al., "B-Spline Signal Processing; Pari I - Theory. "IEEE 

30 Trans, on Sig. Proc. Vol. 41, No. 2, Feb. 1993, pp. 821-833; M. Unser et al., "B-Spline Signal Processing: Part II 
Efficient Design, "IEEE Trans, on Sig. Proc. Vol. 41, No. 2, Feb. 1993, pp. 834-343; and H. Hou e/a/., "Cubic Splines 
for Image Interpolation and Digital Filtering, 'IEEE Trans, on Acoust. Sp. & Sig. Proc. Vol. ASSP-26, No. 6, Dec. 1978, 
pp. 508-517.) For band limited signals, it can be used in place of the far more expensive, infinite-support •sin{x)/x" 
filtering operation that perfectly reconstructs a continuous signal from its Nyquist sampled values. 

3S As is familiar to those skilled in the signal processing arts, the k'th order spline representation- of a signal s(t) is 

defined as 



(6) 



where q^ are the spline coefficients and B,^(t) is the spline continuous-time basis function, built of piecewlse k'th order 
polynomials. One advantage of using a spline representation may be found in the fact that the basis function has a 

45 small finite support - specifically, it is non-zero only over a support of size k+1. This means that the summation of 
equation (6) actually needs to be performed over k+1 coefficients only - a significant saving in computational load 
(and memory) as compared to conventional band-limited filtering. The basis support is divided into k+1 sections at the 

time points t = n , where n = -k+1 k-1 , referred to as nodes. The basis is symmetric with- B^(0) = 1 and B^^{\ S k-1) = 

0. Thus, Bjj(t) is fully defined by assigning (k-1 )'st order polynomials to the positive k-1 sections. The (k-1)(k+1) poly- 

so nomial parameters may be resolved by imposing continuity conditions at the nodes. Specifically, the O'th to (k-l)'st 
order derivatives of B|c(t) are advantageously continuous at the nodes. 

It is known to those skilled in the art that 3rd order splines (i.e.. cubic splines) are sufficient for high-quality inter- 
polation of most signals with very a tow computational load. Therefore, in accordance with one illustrative embodiment 
of the present invention, cubic splines are used in performing waveform interpolation in a low-complexity Wl decoder. 

ss Applying the definition above to 83(1) (/la, the cubic jspline basis), it will be obvious to those skilled in the art that 
equation (6) can be put into a 



6 



EP O 865 028 A1 



10 



75 



s(t-n). = [ (t-/^)^ (t-n)2, t-n, 1] 



-1 3 

3 -6 

-3 0 

1 4 



-3 1 
3 0 
3 0 
1 0 



(7) 



matrix form as follows: ' . 

where n S 1 5 n + 1. Let s(n) be a discrete-time sampled sequence of size N whose underlying continuous signal s{t) 
it is desired to estimate. It follows then from equation (7) above that for t = n, 



(8) 



so 



This defines the transform from the signal to the spline coefficients In a form of an IIR (intinite-impulse-response) 
filtering operation,. familiar to those of ordinary skill in the art. This filter is non-causal and, therefore, care should be 
taken to implement it in a stable fashiori. Also, a proper set of two initial conditions should be selected. As is familiar 
no those of ordinary skill in the art, one stable approach is to split the filtering Into forward (causal) and backward (non- 
causal) operations. Equation (B) can be easily broken into two first order recursions using an auxiliary sequence f^ and 
the stable pole of equation (8), namely, p = 2 • V3. as follows: 



n 
n 



= 0 
= N 



to N - 1 
- 1 to 0 



(9) 



40 



For a complete definition of this transformation, the initial values f.^ and q^ should be known. As such, in accordance 
with one Illustrative embodiment of the present invention, f.^ = q„ = 0. Note that in accordance with the present invention, 
essentially any method for assigning these initial values may be used, but different methods yield different values for 
s(t), especially near the boundaries. Nonetheless, ail of the resulting variants of s(t) advantageously yield the same 
sequence whien sampled at t = n. 

In accordance with another illustrative embodiment of the present invention, another method for setting the initial 
conditions is employed. This method is based on assuming that 6(n) is periodic with period N. Obviously, this implies 
that q„ is also periodic. In this case, if the relation between s(n) and q„ is expressed in the frequency domain by the 
DFT operation, the initial conditions are determined implicitly and no further care need be taken in this regard. Also, 
stability is of no concern in this case. 

The DFT-domain filter H(K) associated with equation (8) may be obtained by computing the DFT of the sequence 



45 



4 , n = 0 

1 , n = I 

1 , n = W - 1 

0 , otherwise 



(XO) 



that is. H(K) = DFT{h„). Similarly. S (K) = DFT{s(n)) and Q{K) = DFT{q„). Thus, the DFT version of equation (8) is simply 
S(K) = H(K) Q(K). Defining the spUna window as W(K) = 1 / H(K). we get the spiine transform: 

Q(K)^W(K)S(K) (11) 

Note that the complex window W(K) may be advantageously computed once off line and kept in ROM. Note also 
that the complexity of the transform is merely 3 operatk)ns per input sample, and that it is actually less then that of the 
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time-domain counterpart as in equation (9), which requires 4 operations per input sample. However, to get the time- 
domain spline coefficients, an IDFT shoj^ld be applied to Q(K). The data processed by the Wl decoder is already given 
in, the DFT domain -- this is the signal S(to,K). Therefore, using W{K) for the spline transform is convenient. And th^ 
time-scale normalization required for the WI process may be conveniently performed by simply appending zeros to S 

5 {Xq.K) along the K*th axis. Moreover, the DFT may be advantageously augmented to a fixed radix-2 size N so that a 
fixed-size IFFT can be advantageously employed. The result of this IDFT is the spline coetficie^nt sequence q„ of size N. 

In accordance with one illustrative embodiment of the present invention, the final synthesis of the reconstructed 
speech signal may now be performed as follows. The cycle function c(t) Is used to locate the sampfihg instants t in 
terms of fractions of the normalized cycle T = N. The four reteyi|ht spline coefficients implied by equation (7) are 

10 identified. These coefficients are interpellated with the corresponding coefficients from the spline vector of the previous 
update - Le., the one obtained from S(t.T.K). Finally, using equation (7). the value s(t) is obtained. This process is 
advantageously repeated tor enough values of t so as to fill the output signal update buffer. Note that c(t) presen/es 
continuity across updates - namely, it increments from its last value from the previous update. However, this is per- 
formed modulo T, which is in IFne with the basic periodicity assumption, 

IS A block diagram of a first illustrative waveform interpolation process for use in a low-complexity Wl coder in ac- 

cordance with the present invention is shown in Figure 3. In particular, the illustrative Wl process shown in Figure 3 
carries out waveform interpolation with use of cubic splines in accordance with the above description thereof. Specif- 
ically, block 31 pads the input spectrum with zeros to ensure a fixed radix-2 size. Then, block 32 takes the spline 
transform as described above, and block 33 performs the IFFT on the resultant data. Block 34 is used to store each 

20 r^^sultant set o1 data so that the interpolation o1 the spline coefficients may be performed (by block 38) based upon the 
current and previous waveforms. Block 36 operates on the current input pilch value and the previous input pitch value 
(as stored by block 35) to perform the dynamic time scaling, and based thereupon, block 37 determines the spline 
coefficients to be interpolated by block 38. Finally, block 39 performs the cubic spline interpolation to produce the 
resultant output speech waveform (in the time domain). 

2S 

. E. Low-Complexity Waveform Interpolation Using Pseudo Cardinal Splines 

In accordance with another illustrative embodiment of the present invention, a variant of the above-described 
method further reduces the required computations by eliminating the use of the spline transform {Le,, the spline win- 
. 30 dow). It is based on the notion of cardinal splines, familiar to those skilled in the signal processing arts and described, 
for example, in M. Unser et aL, *B-Spline Signal Processing: Part I - Theory," cited above. The cardinal spline repre- 
sentation is obtained by imposing one additional condition on the basis function - namel/ that it is strictly zero at the 
nodes: B(t) = 0 for t = n and 1 5^ 0. As a result, it can no longer have a local finite support. Note, however, that its tails 
decay quickly, simitar to that of the ''sin(x)/x'* function, discussed above. The pseudo cardinal splines used here in 
35 accordance with an illustrative embodiment of the present invention are based on using a f/nitesupport basis function 
that satisfies this additional condition with a relaxation of the other {i.e., the continuity) conditions. As in the above- 
described case using cubic splines, a 3rd order symmetric basis function over a support of ;2 ^ t ^ 2 is used. One 
additional condition is imposed, however, namely, 

40 

■B^[^) = B^{'^) = 0 (12) 



Therefore, only one continuity condition has to be given up. The second derivative is permitted to have an arbitrary 

value at the nodes I = -2 and t = 2. Note that the basis function and its first derivative are zero at these points. Deriving 
the basis function under these conditions and expressing the interpolation operation in a matrix form gives: 



SO 



55 



5( t-n) =[ ( t-n)\ ( t-n)2, t-n, 1] 



-0.75 1.25 -1.25 0.75 

1.50 -2,25 1.50 -0.75 

-0.75 0.00 0.75 0.00 

0.00 1.00 0.00 0.00 



(13) 
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10 



where n ^ t ^ n + 1 . which is the same as equation (7) except lor the numerical values of the .matrix. Setting t = 0 (note 
the bottom row of the matrix) gives the relation between the input samples and the spline coefficients which is simply 

= % (14) 

Thai is. the input samples are the spline coefficients and, therefore, no further transformation is required. The 
complexity of the interpolator is as in the above-described embodiment, except that filtering and windowing are ad- 
vantageously avoided. This saves three operations per sample. th.e;i9by reducing the decoder complexity even further. 
Also, note that no additional RAM is needed to store the current and previous spline coefficients atid no additional 
ROM is needed to hold the spline window. 

Note that the performance (/.©., in terms of the quality of the reconstructed speech signal) of an approach based 
on pseudo cardinal splines will likely be not as good as that of one based on regular cubic splines since pseudo cardinal 
, splines are merely an approximation to the real cardinal splines. However, the level of distortion added to the data in 
the modeling and quantization process is typically far above the noise likely to be added by the use of a pseudo cardinal 
spline-based interpolator. Thus, the advantages o1 the reduced complexity outweigh the disadvantages of using such 
an approximation. 

A block diagram of a second Illustrative waveform interpolation process for use in a low-complexity Wl coder in 
accordance with the present invention is shown in Figure 4. In particular, the Wl process shown in Figure 4 carries out 
-waveform interpolation with use of pseudo cardinal splines in accordance with the above description thereof. Specifi- 
''caMy, the operation of the illustrative waveform interpolation process shown in Figure 4 is similar to that of the illustrative 
waveform interpolation process shown in Figure 3, except that the spline transformation (block 32) has become un- 
necessary and has therefore been removed, and the cubic spline interpolation (block 39) has been replaced by a 
pseudo cardinal spline interpolation (block 49). 



so 
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F. Low-complexity Signal Decomposition 

As noted above, the SEW/REW analysis requires parallel filtering of the spectra R(tn, K) for all the harmonic Indices 
K. In conventional Wl coders, this is typically performed with use of 20-tap filters. This is a major contributor to the 
overall complexity of prior art Wl coders. Specifically, this process generates two sequences of spectra that need to 
be coded and transmitted -- the SEW sequence and the REW sequence. While the SEW sequence can be down 
sampled prior to quantization, the REW needs to be quantized at full time and frequency resolution. However, at 2.4 
kbps and lower coding rates, the typical bit budget (see above) is too small to produce a useful representation of the 
data. As an example of this problem, consider a pitch period of 80 samples and an update interval of approximately 
12 msec. For a typical frame size of 25 msec, there are approximately 2 updates in each frame. Typically, only the 
magnitude DFT is quantized, so there are (80 / 2) x 2 = 80 REW values in a frame to quantize. However the bit budget 
allows for only 6 bits per frame [i.e., 3 bits per spectrum) for the REW quantizer - that is, 0.075 bits per component. 
Obvbusly, only a very rough approximation to the REW magnitude spectrum can possibly be transmitted in this case. 
Indeed, in the Wl coder described in W.B. Kleijn et al., "A low-complexity waveform interpolation coder, ' cited above, 
the REW signal is drastically smoothed and parameterized into only 5 parameters using a polynomial cun/e fitting 
technique. 

A similar situation exists tor the SEW signal. Only 7 bits per frame are available according to the typical bit budget 
(see above). Therefore, only the SEW baseband spectrum of about 800 Hz is typically coded. The higher band is 
typically estimated assuming an overall flat LP spectrum, that is, 



SEW(t.K) + REW(t, /g = 1 (15) 

This assumption regarding the flatness of the LP spectrum has been widely used in low-rate speech coding and. 
particularly, in Wl-based coders. It is a reasonable assumption to make in the absence of bit resources -- however, it 
is a gross under-representation of the LP spectrum, especially when the spectrum is taken over short frames, tike in 
the typical Wl coder case. The SEW signal and the REW signal are therefore severely distorted in the quantization 
process and not much of the signal characteristic is left from the original signal after coding. 

Having recognized the existence of a substantial mismatch between the analysis (e. g., the decomposition) of the 
original residual signal and the quantization resolutions actually performed in typical Wl coding environments, one 
illustrative embodiment of the present invention provide a much simpler analysis than that performed by prior art Wl 
coders. In particular, it is recognized that it is unnecessary to perform a very expensive analysis at a very high resolution 
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only to loose most of. the information at the quantization stage. Since the performance of the coder is essentially dom- 
inated by the quantizer a much simpler analysis can in theory be used. Thus, in accordance with an illustrative em- 
bpdiment of the present invention, a new approach is taken to the task of signal decomposition and coding, changing 
the way the SEW and the REW are defined and processed. 

5 

1. Low-complexity signal decomposition of the unstructured component 

In accordance with one illustrative embodiment of the present inyention. the unstructured component of the residual 
signal is exposed by merely taking the difference between the properly aligned normalized current and previous spectra. 

10 This is essentially equivalent to simplifying the REW signal generation by replacing the 20th-order filter typically found 
in a conventional Wl encoder with a first-order tiller. In voiced speech, for example, this difference reflects an unstruc- 
tured random component. It will be referred to herein as simply the random spectrum (RS). The RS's may be advan- 
tageously smoothed by a low-order {e,g., two or three) orthogonal polynomial expansion (using, e.g., three or four 
- parameters per spectrum). It can be seen by examining typical smoothed SEW signals and typical smoothed RS's that 

IS both spectra are almost always monotonically increasing with frequency, in other words, the residual signal is invariably 
monotonically less structured in higher frequency bands. Given a bit allocation of only 3 bits to code each RS (see 
discussion of typical bit allocations above), only 8 such smoothed spectra can be used by the RS quantizer. 

By training a 3-bit vector quantizer (VQ) In a conventional manner over a long sequence of smoothed RS's, a set 
of 8 codebook spectra can be generated. One such Illustrative set of codebook spectra is shown in Figure 5. In ac- 

20 cordance with the illustrative embodiment of the present invention, smoothing and quantization can be combined during 
the coding process (as described, for example, in W.B. Kleijn et a/., 'A low-complexily waveform interpolation coder 
cited above), by doing three full-size inner-products per vector However,. note that the constellation of the illustrative 
set of codebook spectra provides for an additional level of simplification. Specifically, since the curves shown in Figure 
5 are monotonically increasing with their indices, they can be pointed to uniquely based upon the areas under them, 

ss which is equivalent to their energies. Heuristically, this implies that a scalar parameter can be computed from the input 
data vk/hich can point to an entry in the RS codebook. in other words, a codebook entry {e.g., an illustrative curve from . 
Figure 5) represents a smoothed version of the magnitude difference of two aligned normalized spectra, 
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3S 



RS{K)=^\S,{K)'S^(k)\ (16) 
consistent with the RS definition. The corresponding energy is 



E \S^{K)-S^iK)\^ = 2 - 2 X; S,{K)S;{K) 



(17) 



where the last term can be identified as the square of the cross correlation between the corresponding time-domain 
signals. These signals are the properly aligned two successive snapshots of the input signal {i.e., the LP residual), if 
the update interval is approximately one pitch period in size, this cross-correlation is related to the pitch-lag correlation 
C(P) of the input, where P is the pitch period and C(.) is the standard correlation function. Therefore (ignoring the factor 
2), the parameter u = 1 - (C(P))2 is essentially used as an initial "soft index" to the codebook. Using a quantization 
table, u is advantageously mapped into an index In the range {0,7] which points lo an RS curve (/.e., a codebook entry). 
^s The above approach has four major advantages from the perspective of encoder complexity. First, no explicit high- 

resolution RS needs to be generated. Second, no alignment is needed. Third, no filtering is required. And fourth, no 
curve fitting is required. Note, however, that in accordance with this illustrative embodiment of the present invention, 
the pitch-lag correlation is found at the current update rate. 

The parameter u as defined above reflects the level of "unvoicing" in the signal. Its temporal dynamics is predictable 
to a certain degree since it is consistently high in unvoiced regions and low in voiced ones. This can be efficiently 
utilized by applying VQ to consecutive values of this parameter Thus, in accordance with another illustrative embod- 
iment of the present invention, instead of directly quantizing the RS using 3 bits per vector a 6-bit VQ may be advan- 
tageously used to quantize and transmit a u-vector within a frame. At the receiver, the decoded u-values may be 
mapped into a set of orthogonal polynomial parameters and a smoothed RS spectrum may be generated therefrom. 

Note that the decoded RS represents a magnitude spectrum. The complete complex RS may in accordance with 
an illustrative embodiment of the present invention, be obtained by adding a random phase spectrum, which is con- 
sistent with the presumption of an unstructured signal. The random phase may be obtained inexpensively by, for ex- 
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ample, a random sampling of a phase table. Such an illustrative table holds 128 two-dimensional vectors of radius 1. 
An index to this table. I, where 0 < I < 1 28, may. for example, be generated pseudo-randomly by the C-language index 
recursion 



/ = ( seeds: { {^seed) ♦ 17) &4096 ) > > 5 (18) 

which can be advantageously implemented by fast bitwise operations. 

2. Low-complexity signal decomposition of the structured corhponent 

In typical Wl coders the SEW signal is obtained by filtering each harmonic component of a sequence of properly 
aligned pitch-size spectra along the temporal axis using a 20-tap FIR (finite-impulse-response) towpass filter. The 
filtered sequence is then decimated to one spectrum per frame. This is equivalent to taking a weighted average of 
these spectra once per frame. As noted earlier, both filtering and alignment may be advantageously avoided in accord- 
ance with certain illustrative embodiments of the present invention. 

In cenain illustrative embodiments of the present invention, the structured signal may be advantageously processed 
as follows. Given the pitch period P for the current frame, a new frame containing an integral number M of pitch periods 
is determined. Typically, the new frame overlaps the nominal frame. The pitch-size average spectrum, referred X6 herein 
as AS. may then be obtained by applying a DFT to this frame, decimating the MP-size spectrum by the factor M and 
normalizing the result. This approach advantageously eliminates the need for spectral alignment. To reduce the DFT 
complexity, the SEW-frame may be first upsampled to a radix-2 size N > MP. and then a Fast Fourier transform (FFT) 
may be used. Note that this time scaling does not affect the size of the spectrum which is still equal to MP. The up- 
sampling may, for example, be performed using cubic spline interpolation as described above. 

The average spectrum, AS, may be viewed as a simplified version of the SEW using a simple filter. Unlike the 
REW and SEW signals generated by the conventional Wl coder. AS(K) and (the unsmoothed) RS(K) are no/ comple- 
mentary, since they are not generated by two complementary filters. In fact. AS{K) by Itself may be viewed as the 
current estimate of the LP magnitude spectrum. Therefore, the part of the spectrum wtiich may be considered the 
structured spectrum (88) is 

SS(K) - AS(K) - RS(K) (19) 

The bit budget of the Wl coder as described above provides for only 7 bits for the coding of the AS. Since the lower 
frequencies of the LP residual are perceptually more important, only the £>as©i?ar7d containing the lower 20% of the 
SEW spectrum is advantageously coded in accordance with an illustrative embodiment of the present invention. The 
rest of the AS magnitude spectrum may, for example, be presumed to be flat, with AS(K) = 1 . 

Thus, the illustrative low-complexity coder codes the AS baseband and then transmits the coded result once per 
frame. The coding may be illustratively performed using a ten-dimensional 7-blt VQ of a variable dimension, D, where 
D is the lower of 0.2*P/2 or 10. If D < 10, only the first D terms of the codevectors may be used. At the receiver, the 
AS baseband may be interpolated at the synthesis update rate and the SS(K) spectrum may be computed therefrom. 

The magnitude spectrum SS(K) represents a periodic signal. Therefore, a fixed phase spectrum may be advan- 
tageously attached thereto so as to provide tor some level of phase dispersion as obsen/ed in natural speech. This 
maintains periodicity while avoiding buzziness. The phase spectrum, which may be derived from a real speaker, illus- 
tratively has 64 complex values of radius 1 . It may be held in the same phase table used by the RS (the first 64 entries), 
thereby incurring no extra ROM. The resulting complex SS is Illustratively combined with the complex RS to form the 
final quantized LP spectrum for the current update. 

G. Update Rate Considerations 

In conventional Wl coding, the SEW and the REW can be generated and processed at any desired update rate 
independently of the current pitch Moreover, the rates may be different in the encoder and decoder. If a fixed rate is 
used (e.g., a 2.5 msec, update inten/al), the data flow control is straightforward. However, since the spectrum size is. 
in fact, pitch dependent, so is the resulting computational load. Thus, at a fixed update rate, the complexity increases 
with the value of the pitch period. Since the maximurp computational load is often of concern, it is advantageous to 
"equalize" the complexity Therefore, in accordance with an iHustrativ-e. embodiment of the present invention, in order 
to reduce the peak load, the update rate advantageously varies proportionally to the pitch frequency. 

Note that for typical conventional Wl encoders, the short-term spectral snapshots are processed at pitch cycle 
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intervals. This is b&sed on the assumption that for near-periodic speech it is sufficient to monitor the signal dynamics 
at a pitch rate. Such a variable sampling rate poses some difficulty at the SEW/REW signal filtering stage, which 
therefore calls for some special filtering procedure. 

In the illustrative low-complexity Wl {LCWI) encoder in accordance with the present invention however, such dif- 
5 ficulties do not exist, since the AS is processed once per frame using a fixed size FFT The RS is represented by the 
u-parameter which measures the changes at pitch intervals {i.e., the pitch-lag correlation), while being updated af a 
fixed rate. 

In both conventional Wl decoders and the illustrative LCWI decoder, the update rale is pitch depenlSent to equalize 
the load and to make sure the outcome is not overly periodic (/.e.v^the rate is too low). Moreover, the spline transform 
10 and the I FFT of the illustrative LCWI coder are made to be pitch dependent by rounding up the pitch value to the nearest 
radix-2 number This advantageously reduces the variations in computational load across the pitch range. Thus, given 
the current pitch, an update rate control (URC) procedure may be advantageously employed to determine the synthesis 
sub-frame size over which the spectrum is reconstructed and the output signal is interpolated. Since the u>parameter 
. is illustratively transmitted at aTixed rate (e.g., twice per frame). It may be interpolated at the decoder if a higher update 
IS rate is called for ^ 

H. Low Complexity Quantization of the LP Parameters 

In the illustrative LCWI coder, a low complexity vector quantizer (LCVQ) may be used in coding the LP parameters 
20 to further reduce the computational load. The illustrative LCVQ is based on that described in detail in J. Zhou eta/., 
""Simple fast vector quantization of the line spectral frequencies, Proc. ICSLP*96. Vol. 2, pp. 945-948, Oct. 1996, which 
is hereby incorporated by reference as if fully set forth herein. (Note that the illustrative LCVQ descrili)ed herein is not 
necessarily specific to Wl coders - it can also be advantageously used in other LP-based speech coders.) 

In the illustrative LCVQ, the LP parameters are given in the form of 10 line spectral frequencies (LSF). The ten- 
2S dimensional LSF vectors are coded using 30 bits and 25 bits in the 1 .2 kbps and 2.4 kbps coders, respectively The 
LSF vector are commonly split into 3 sub-vectors since a full-size 25 or 30 bit VQ is not practically implementable. In 
particular, the sizes of the three LSF sub-vectors are (3. 3. 4) and (3, 4, 3) for the 1.2 kbps and 2.4 kbps coders, 
respectively. The number of bits assigned to the three sub-VQ's are (10, 10. 10) and (10, 10, 5), respectively. Each 
sub-VQ may comprise a full-search VQ. meaning that a global search Is performed over 1024 (or 32) codevector 
30 candidates. However, in the illustrative LCWI coder in accordance with the present invention, the full-search VQ's are 
replaced by faster VQ's as described below. 

Specifically, the illustrative fast VQ used herein is approximately 4 times faster than a full-search VQ. It uses the 
same optimally-trained codebook and achieves the same level of performance. In particular, it is based on the concept 
of classified VQ. familiar to those skilled in the art. The main codebook is partitioned into several sub-codebooks 
3S (classes). An Incoming vector is first classified as belonging to a certain class. Then only that class and a few of its 
neighbors are searched. The classification stage is carried out by yet another small-size VQ whose entries point to 
their own classes. This codebook may be advantageously embedded In the main codebook so no additional memory 
locations are needed tor the codevectors. However, some small increase (approximately 2%) in total memory may be 
required for holding the pointers to the classes. 

40 

I. An Illustrative Low-Complexity Wl Coder 

Figure 6 shows a block diagram of an LCWI coder in accordance with one illustrative embodiment of the present 
invention. Specifically, Figure 6 shows encoder 61 with an illustrative block diagram thereof, decoder 62 with an illus- 
4S trative block diagram thereof, and the illustrative data flow between the encoder and the decoder. In particular, the 
transmitted bit stream illustratively includes the indices of the quantized gain. LSF's. RS, AS and pitch, identified as 

G, L; R, A, and P, respectively. 

1. An illustrative LCWI encoder 

SO 

In the illustrative encoder shown in Figure 6. an LP analysis is applied to the input speech (block 6104) and the 
LCVQ described above is used to code the LSF's (block 6109). The input speech gain is computed by block 6103 at 

a fixed rate of 4 times per frame. The gain is defined as the RMS of overlapping prtch-size subframes spaced uniformly 
within the main frame. This makes the gain contour very smooth in stationary voiced speech. If the pitch cycle is too 
SS • short, two or more cycles may be used. This prevents,skipping segments of possibly important gain cues. Four gains 
are coded as one gain vector per frame. For the illustrative 2.4 kbps version of the encoder, 10 bits are assigned to 
the gain. The gain vector is normalized by its RMS value called the "super gain". A two-stage LCVQ is used (block 
6109). First the normalized vector is coded using a 6-bit VQ. Then, the logarithm (log) of the super-gain is coded 
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d/fferentiaity using a A'b'tt quantizer. This coding technique increases the dynamic range o1 the quantizer and, at the 
sarDe time, allows it to represent short-term (/. e„ within a vector) changes in the gain, representing, for example, 
onsets.. In the illuslratlve i .2 kbps version ot the encoder, no super-gain is used and a.single 8-blt four-dimensional 
VQ is applied to the log-gains. 

s The input is inverse-fillersd using the LP coefficients to get the LP residual (block 6101). Pitch detection is done 

on the residual to get the current pilch period (block 6102). The RS and the AS signals are processed as described 
above. In block 6105, u-coefficients are generated and in block 6110. the u-coefficients are coded by a two-dimensional 
VQ using 5 and 6 bits tor the illustrative 1.2 and 2.4 kbps coders, respectively In the illustrative 2.4'febps coder, the 
AS baseband is coded by ten-dimensional VQ using 7 bits (blockS'6106, 6107; 6111 , and 6112). In the 1 .2 kbps coder. 

10 the AS is not processed and coded, but rather considered a constant - /.©., AS(K) = 1. for all K. Therefore, blocks 
6105. 6107, 6111. and 6112 in Figure 6 do not exist in the illustrative 1.2 kbps coder. 

2, An illustrative LCWI decoder 

IS In the illustrative decoder shown in Figure 6, the received pitch value is used by the update rate control (URC) in 

block 6209 to set the current update rate - that Is, the number of sub-trames over which the entire interpolation and 
synthesis process is to be performed. The pitch is interpolated in block 6205 using the previous value and a value is 
assigned to each subf rame. 

In block 6201. the super gain is differentially decoded and exponentiated; the normalized gain vector is decoded 
20 ^and combined with the super gain; and the 4 gain values are interpolated into a longer vector. If requested by the URC. 
-The LP coefficients are decoded once per (rams and interpolated with the previous ones to obtain as many LP vectors 
as requested by the URC (block 6202). An LP spectrum is obtained by applying DFT 6206 to the LP vector Note that 
this is advantageously a low-complexity DFT. since the input is only 1 0 samples. The DFT may be performed recursively 
to avoid expensive trigonometric functions. Alternatively, an FFT could be used in combination with a cubic-spline- 
2S based re-sampling. 

In block 6203. the RS vector is decoded and interpolated if needed by the URC. Each u-value is mapped into an 
expansion parameter set and a smoothed magnitude RS is generated (block 6207). A random phase is attached in 
block 6210 to generate the complex RS. 

In the illustrative 2.4 kbps coder, the AS is decoded and interpolated with the previous vector (block 6204). The 

30 SS magnitude spectrum is obtained in block 6208 by subtracting the RS, and then the SS phase is added in block 
6211. The complex RS and SS data are combined (block 6213), and the result is shaped by the LP spectrum and 
scaled by the gain (block 621 2). The result is applied to the waveform Interpolation module (block 6214) which outputs 
the coded speech. The waveform interpolation module may comprise the illustrative waveform interpolation process 
of Figure 3, the illustrative waveform interpolation process of Figure 4, or any other waveform interpolation process in 

35 accordance with the principles of the present invention. 

Finally, a (preferably mild) post-filtering is applied in block 621 5 to reshape the output coding noise. For example, 
an LP-based post-filter similar to the one described in J.H. Chen et a!., "Adaptive postfillering for quality enhancement 
of coded speech. " IEEE Trans. Speech and Audio Processing, Vol. 3. 1 995, pp. 59-71 may be used. Such a post-filter 
enhances the LP formant pattern, thereby reducing the noise in between the formants. Alternatively, a post-filtering 

"to operation could be included in the LP shaping stage {i.e.. in block 6212) as is done in the Wl coder described in W.B. 
Kleijn et al. "A low-complexity waveform interpolation coder." cited above. However, to reduce the overall noise, in-, 
eluding that of the cubic-spline interpolator, the post-filter is preferably placed at the end of synthesis process as shown 
in the illustrative embodiment of Figure 6. 

J. Addendum 



For clarity of explanation, the illustrative embodiment of the present invention has been presented as comprising 
individual functional blocks (including functional blocks labeled as "processors"). The functrons these blocks represent 
may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable 
of executing software. For example^ the functions of processors presented herein may be provided by a single shared 
processor or by a plurality ot individual processors. Moreover, use ot the term "processor" herein should not be con- 
strued to refer exclusively to hardware capable of executing software. Illustrative embodiments may comprise digital 
signal processor (DSP) hardware, such as Lucent Technologies' DSP16 or DSP32C, read-only memory (ROM) for 
storing software performing the operations discussed below, and random access memory (RAM) lor storing DSP re- 
sults. Very large scale integration (VLSI) hardware embodiments, as well as custom VLSI circuitry in combination with 
a general purpose DSP circuit, may also be provided. Any and all of these embodiments may be deemed to fall within 
the meaning of the word "processor" as used herein. 

Although a number of specific embodiments of this invention have been shown and described herein, it is to be 
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understood that these embodiments are merely illustrative o1 the many possible specific arrangements which can be 
devised in application of the principles of the invention. Numerous and varied other arrangements can be devised in 
apcordance with these principles by those of ordinary skill in the an without departing from the invention. For example, 
the use of terms such as "signal receiver, " "spline coefficient generator, and "signal synthesizer, ' as used in the instant 
s claims herein, are intended to cover any mechanisms which perform the correspondingly identified function. 



Claims 

10 1. A method of synthesizing a reconstructed speech signal basedl on encoded signals communicated via a commu- 
nications channel, the method comprising the steps of: 

receiving al least two communlcaled signals. Including a first communicated signal comprising a first set of 
frequency domain parameters representing a first speech signal segment of a length equal to a first pitch- 
es period and a second communicated signal comprising a second set of frequency domain parameters repre- 
senting a second speech signal segment of a length equal to a second pitch-period; 

generating at least two sets of spline coefficients, including a first set of spline coefficients which comprises 
a spline representation of a time domain transformation of the first set of frequency domain parameters and 
so a second set of spline coefficients which comprises a spline representation of a time domain transformation 

^ of the second set of frequency domain parameters; 

synthesizing the reconstructed signal by interpolating between the spline representation of the time domain 
transformation of the first set of frequency domain parameters and the spline representation of the time.domain 
2S transformation of the second set of frequency domain parameters. 

2. The method of claim 1 wherein the spline representations comprise cubic spline representations. 

3. The method of claim 1 wherein the spline representations are based on cardinal spline representations. 

4. The method of claim 3 wherein the spline representations have a finite support basis function. 
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5. The method of claim 4 wherein the spline representations comprise samples of the time domain transformation 
corresponding thereto. 

6. The method of claim 1 wherein the first pitch period and the second pitch period are unequal and wherein the step 
of synthesizing the reconstructed signal comprises the step of modifying the time scale of at least the spline rep- 
resentation of the time domain transformation of the second set of frequency domain parameters. 

<o 7, The method of claim 1 further comprising the step of performing an inverse transform on the first and second sets 
of frequency domain parameters to produce corresponding first and second sets of time domain parameters, and 
wherein the generating step is based on said first and second sets of time domain parameters. 

8. The method of claim 7 further comprising the step of zero-padding the first and second sets of frequency donnain 
45 parameters to a fixed radix-2 size prior to the step of performing said inverse transform. 

9. The method of claim 8 wherein said inverse transform comprises an IFFT. 

10. The method of claim 1 wherein the step of synthesizing the reconstructed signal comprises the steps of: 



so 



ss 



generating a set of interpolated spline coefficients which comprises a spline representation of a continuous 
time domain signal; and 

generating the reconstructed signal based on the set of interpolated spline coefficients. 

11. The method of claim 10 wherein the recorislfucted signal is generated by sampling the continuous time domain 
signal at a non-uniform rate. 
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12, The method o1 claim 11 wherein the non-unilorm rate is determined based on the first and second prtch periods. 

13. A speech decoder which synthesizes a reconstructed speech signal based on encoded signals communicated via 
a communications channel, the decoder comprising: 

a signal receiver which receives at least two communicated signals, Including a first. communicated signal 
comprising a first set of frequency domain parameters representing a first speech signal segment of a length 
equal to a first pitch-period and a second communicated signal comprising a second set of f rei^uency domain 
parameters representing a second speech signal segmeint'of a length equal to a second pitch-period; 

a spline coefficient generator which generates at least two sets of spline coefficients, including a first'set of 
spline coefficients which comprises a spline representation of a time domain transfonmation of the first set of 
frequency domain parameters and a second set of spline coefficients which comprises a spline representation 
of a time domain translormation of the second set of frequency domain parameters; 

a signal synthesizer which synthesizes the reconstructed signal by interpolating between the spline represen- 
tation of the time domain transformation of the first set of frequency domain parameters and the spline repre- 
sentation of the time domain transformation of the second set of frequency domain parameters. 

20 =14. The decoder of claim 13 wherein the spline representations comprise cubic spline representations. 

15. The decoder of claim 13 wherein the spline representations are based on cardinal spline representations. 

16. The decoder of claim 15 wherein the spline representations have a finite support basis function. 

es . 

17. The decoder of claim 16 wherein the spline representations comprise samples of the time domain transformation 
corresponding thereto. 

18. The decoder of claim 13 wherein the first pilch period and the second pitch period are unequal and wherein the 
30 signal synthesizer comprises means for modifying the lime scale of at least the spline representation of the time 

domain transformation of the second set of frequency domain parameters. 

19. The decoder of claim 1 3 further comprising an inverse transform performed on the first and second sets of frequency 
domain parameters to produce corresponding first and second sets of time domain parameters, and wherein the 

3S spline coefficient generator is based on said first and second sets of time domain parameters. 

20. The decoder of claim 1 9 further comprising means for zero-padding the first and second sets of frequency domain 
parameters to a fixed radix-2 size for use by said inverse transform. 

^0 21. The decoder of claim 20 wherein said inverse transform comprises an IFFT. 

22. The decoder of claim 13 wherein the signal synthesizer comprises: 

means for generating a set of interpolated spline coefficients which comprises a spline representation of a 
continuous time domain signal; and 

means lor generating the reconstructed signal based on the set of Interpolated spline coefficients. 

23. The decoder of claim 22 wherein the reconstructed signal is generated by sampling the continuous time domain 
50 signal at a non-uniform rate. 

24. The decoder of claim 23 wherein the non-unitorm rate is determined based on the first and second pitch periods. 
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